-
Notifications
You must be signed in to change notification settings - Fork 13.4k
support GLM-4.5V vision model #16600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
need `clip.vision.rope.freq_base` for GLM-4.5V
So, it turns out that vision in this model is based on Qwen3-VL, which still needs support from llama.cpp. I am pretty familiar with llama.cpp in general but not with Also just saw this thread (#16207) in which someone has posted a patch to get Qwen3-VL kinda-sorta-working in llama.cpp. I will take a look at that too and see if it is helpful - it might make more sense to get Qwen3-VL to a working state in llama.cpp first and only then start working on this PR on top of that. Not sure, just thinking out loud. |
Thanks for your work! @ddh0 |
Here's my silly implementation for the unfinished mmproj part ( def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
assert self.has_vision_encoder
assert self.hparams_vision is not None
self.hparams_vision["num_attention_heads"] = self.hparams_vision.get("num_heads")
self.hparams_vision["num_hidden_layers"] = self.hparams_vision.get("depth")
def set_gguf_parameters(self):
# remain code from ddh0 as is
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused
if name.startswith("model.visual."):
name = name.replace("model.visual.", "visual.", 1)
if ".qkv." in name:
if data_torch.ndim == 2: # weight
c3, _ = data_torch.shape
else: # bias
c3 = data_torch.shape[0]
assert c3 % 3 == 0
c = c3 // 3
wq = data_torch[:c]
wk = data_torch[c: c * 2]
wv = data_torch[c * 2:]
return [
(self.map_tensor_name(name.replace("qkv", "q")), wq),
(self.map_tensor_name(name.replace("qkv", "k")), wk),
(self.map_tensor_name(name.replace("qkv", "v")), wv),
]
if name.startswith("visual.downsample."):
suffix = name.split(".", 2)[2]
new_name = self.format_tensor_name(gguf.MODEL_TENSOR.V_POST_NORM, suffix=suffix[1])
return [(new_name, data_torch)]
yield self.map_tensor_name(name), data_torch
else:
return Then edit
If you put these in the right place, running python convert_hf_to_gguf.py /path/to/GLM-4.5V --outfile /path/to/GLM-4.5V-mmproj.gguf --mmproj will succeed. But I can only help here. Lacking of knowledge of the model prevents me from digging further. And the converted mmproj won't work until we do it right. And I believe my treatment for Maybe refer to |
Add support for zai-org/GLM-4.5V vision model to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.
The architecture is
Glm4vMoeForConditionalGeneration
("model_type": "glm4v_moe"
). Internally, this consists of an LLM (text model) and a ViT (vision adapter / multimodal projector):LLM (text model
glm4v_moe_text
)model.language_model.
apply_multimodal_rotary_pos_emb
, it applies rotary embeddings across temporal, height, and width dimensions for visual tokensViT (vision adapter
glm4v_moe
)Aimv2VisionModel
model.visual.
Glm4vMoeVisionEmbeddings
module to handle varied image resolutionsapply_rotary_pos_emb_vision
)Other notes:
References:
See also: